ApplyMomentum
=================
对权重张量执行 Momentum/改进动量优化更新。

    .. math::

        \begin{aligned}
        accu_t &= moment \cdot accu_{t-1} + g_t \\
        update_t &= \begin{cases}
            (accu_t \cdot moment + g_t), & \text{if nesterov = True} \\
            accu_t, & \text{otherwise}
        \end{cases} \\
        weight_t &= weight_{t-1} - learning\_rate \cdot update_t
        \end{aligned}


    输入：
        - **weight** - 待更新权重张量首地址。
        - **accumulate** - 动量累积张量首地址。
        - **gradient** - 梯度张量首地址。
        - **learning_rate** - 学习率。
        - **moment** - 动量系数。
        - **nesterov** - 是否启用 Nesterov 动量。
        - **start** - 参与计算的起始索引（闭区间）。
        - **end** - 参与计算的结束索引（开区间）。
        - **core_mask(int, 可选)** - 核掩码（仅适用于共享存储版本）。

    输出：
        - **weight** - 原地写回更新后的权重张量。
        - **accumulate** - 原地写回更新后的动量张量。

    支持平台：
        ``FT78NE``
        ``MT7004``

    .. note::
        - FT78NE 支持 fp32 数据类型。
        - MT7004 支持 fp16、fp32 数据类型。


**共享存储版本:**

.. c:function:: void hp_applymomentum_s(half *weight, half *accumulate, const half *gradient, float learning_rate, float moment, bool nesterov, int start, int end, int core_mask)
.. c:function:: void fp_applymomentum_s(float *weight, float *accumulate, const float *gradient, float learning_rate, float moment, bool nesterov, int start, int end, int core_mask)

    
    **C调用示例：**

    .. code-block:: c
        :linenos:
        :emphasize-lines: 15

        // FT78NE 多核示例
        #include <stdio.h>
        #include <stdbool.h>

        int main(void) {
            float *weight = (float *)0xA0000000;      // DDR 存储
            float *accumulate = (float *)0xB0000000;
            float *gradient = (float *)0xC0000000;
            int start = 0;
            int end = 4096;
            int core_mask = 0xff;
            float learning_rate = 1e-2f;
            float moment = 0.99f;
            bool nesterov = false;
            fp_applymomentum_s(weight, accumulate, gradient,
                                learning_rate, moment, nesterov,
                                start, end, core_mask);
            return 0;
        }


**私有存储版本:**

.. c:function:: void hp_applymomentum_p(half *weight, half *accumulate, const half *gradient, float learning_rate, float moment, bool nesterov, int length)
.. c:function:: void fp_applymomentum_p(float *weight, float *accumulate, const float *gradient, float learning_rate, float moment, bool nesterov, int length)

    **C调用示例：**

    .. code-block:: c
        :linenos:
        :emphasize-lines: 13

        // MT7004 单核示例
        #include <stdio.h>
        #include <stdbool.h>

        int main(void) {
            half *weight = (half *)0x10000000;       // L2 存储
            half *accumulate = (half *)0x10002000;
            half *gradient = (half *)0x10004000;
            int length = 2048;
            float learning_rate = 5e-3f;
            float moment = 0.9f;
            bool nesterov = true;
            hp_applymomentum_p(weight, accumulate, gradient,
                               learning_rate, moment, nesterov,
                               length);
            return 0;
        }